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Abstract 

Kernel density estimation, a.k.a. Parzen windows, is a popular density estimation method, 
which can be used for outlier detection or clustering. With multivariate data, its performance is 
heavily reliant on the metric used within the kernel. Most earlier work has focused on learning 
only the bandwidth of the kernel (i.e., a scalar multiplicative factor). In this paper, we propose 
to learn a full Euclidean metric through an expectation-minimisation (EM) procedure, which 
can be seen as an unsupervised counterpart to neighbourhood component analysis (NCA). In 
order to avoid overfitting with a fully nonparametric density estimator in high dimensions, we 
also consider a semi-parametric Gaussian-Parzen density model, where some of the variables are 
modelled through a jointly Gaussian density, while others are modelled through Parzen windows. 
For these two models, EM leads to simple closed-form updates based on matrix inversions and 
eigenvalue decompositions. We show empirically that our method leads to density estimators 
with higher test-likelihoods than natural competing methods, and that the metrics may be 
used within most unsupervised learning techniques that rely on local distances, such as spectral 
clustering or manifold learning methods. Finally, we present a stochastic approximation scheme 
which allows for the use of this method in a large-scale setting. 



1 Introduction 

Most unsupervised learning methods rely on a metric on the space of observations. The quality of the 
metric directly impacts the performance of such techniques and a significant amount of work has been 
dedicated to learning this metric from data when some supervised information is available [XN JR02 , 
GRHS04. BJ04 . However, in a fully unsupervised scenario, most practitioners use the Mahalanobis 
distance obtained from principal component analysis (PCA). This is an unsatisfactory solution as 
PCA is essentially a global linear dimension reduction method, while most unsupervised learning 
techniques, such as spectral clustering or manifold learning, are local. 

In this paper, we cast the unsupervised metric learning as a density estimation problem with a 
Parzen windows estimator based on a Euclidean metric. Using the maximum likelihood framework, 
we derive in Section[3]an expectation-minimisation (EM) procedure that maximizes the leave-one-out 
log-likelihood, which may be considered as an unsupervised counterpart to neighbourhood component 
analysis (NCA) [GRHS04 . As opposed to PCA, which performs a whitening of the data based on 
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global information, our new algorithm globally performs a whitening of the data using only local 
information, hence the denomination local component analysis (LCA). 

Like all non-parametric density estimators, Parzen windows density estimation is known to overfit in 
high dimensions [Sil86,, and thus LCA should also overfit. In order to keep the modelling flexibility 
of our density estimator while avoiding overfitting, we propose a semi-parametric Parzen-Gaussian 
model; following BKS + 06], we linearly transform then split our variables in two parts, one which 
is modelled through a Parzen windows estimator (where we assume the interesting part of the data 
lies), and one which is modelled as a multivariate Gaussian (where we assume the noise lies). Again, 
in Section 21 an EM procedure for estimating the linear transform may be naturally derived and 
leads to simple closed-form updates based on matrix inversions and eigenvalue decompositions. This 
procedure contains no hyperparameters, all the parameters being learnt from data. 

Since the EM formulation of LCA scales quadratically in the number of datapoints, making it 
impractical for large datasets, we introduce in Section [5] both a stochastic approximation and a 
subsampling technique allowing us to achieve a linear cost and thus to scale LCA to much larger 
datasets. 

Finally, in Section [6l we show empirically that our method leads to density estimators with higher 
test-likelihoods than natural competing methods, and that the metrics may be used within unsu- 
pervised learning techniques that rely on such metrics, like spectral clustering. 



2 Previous work 



Many authors aimed at learning a Mahalanobis distance suited for local learning. While some 
techniques required the presence of labelled data GRHS04 , XNJR02] IB J04] , others proposed ways to 
learn the metric in a purely unsupervised way, e.g., [ZMP04 who used the distance to the k-th nearest 
neighbour as the local scaling around each datapoint. Most of the other attempts at unsupervised 
metric learning were developed in the context of kernel density estimation, a.k.a. Parzen windows. 
The Parzen windows estimator (Par62| is a nonparametric density estimation model which, given n 
datapoints {xi, . . . , x„} in R d , defines a mixture model of the form p(x) = i Y^=i ^( x > x j> ^) where 
if is a kernel with compact support and parameters 9. We relax the compact support assumption 
and choose K to be the normal kernel, that is 



p(x) = - V^(x, Xj -,S) = -m-i(2n)-i f>xp 

3=1 3=1 



■^(x-x^S-Hx-x,.) 



where X is the covariance matrix of each Gaussian. As the performance of the Parzen windows 
estimator is more reliant on the covariance matrix than on the kernel, there has been a large body 
of work, originating from the statistics literature, attempting to learn this matrix. However, al- 
most all attempts are focused on the asymptotic optimality of the estimators obtained with little 
consideration for the practicality in high dimensions. Thus, the vast majority of the work is lim- 
ited to isotropic matrices, reducing the problem to finding a sing le scalar h |Ros56| IDui76| IRud82| 
ICGW83I iBowMl IPM901 ISJ91] . the bandwidth, and the few extensions to the non-isotropic cases are 
numerically expensive [DH051 IJH09] . 

An exception is the approach proposed in [VB02, , which is very similar to our method, as the authors 
learn the covariance matrix of the Parzen windows estimator using local neighbourhoods. However, 
their algorithm does not minimize a well-defined cost function, making it unsuitable for kernels 
other than the Gaussian one, and the locality used to compute the covariance matrix depends on 
parameters which must be hand-tuned or cross-validated. Also, the modelling of all the dimensions 



2 



using the Parzen windows estimator makes the algorithm unsuitable when the data lie on a high- 
dimensional manifold. In an extension to [VB02j , BLV06 uses a neural network to compute 
the leading eigenvectors of the local covariance matrix at every point in the space, then uses these 
matrices to do density estimation and classification. Despite the algorithm's impressive performance, 
it does not correspond to a linear reparametrisation of the space and thus cannot be used as a 
preprocessing step. 



3 Local Component Analysis 



Seeing the density as a mixture of Gaussians, one can easily optimize the covariances using the EM 
algorithm }DLR77j . However, maximizing the standard log-likelihood of the data would trivially 
lead to the degenerate solution where S goes to to yield a sum of Dirac distributions. One 
solution to that problem is to penalize some norm of the precision matrix to prevent it from going 
to infinity. Another, more compelling, way is to optimize the leave-one-out log-likelihood, where 
the probability of each datapoint Xj is computed under the distribution obtained when Xj has been 
removed from the training set. This technique is not new and has already been explored both in the 
supervised [GRHS04 1 IGR06] and in the unsupervised setting |Dui76j . However, in the latter case, 
the cross-validation was then done by hand, which explains why only one bandwidth parameter 
could be optimized^. We will thus use the following criterion: 



£(£) = -X>g[— j^Affaxj,?;) 

*=1 j^i 



(1) 



< est - ^ ^ Xij log J\f (xj, xj, S) + Aij log \j , (2) 



with the constraints Vi , Ylj-tj \j — 1- This variational bound is obtained using Jensen's inequality. 

The EM algorithm optimizes the right-hand side of Eq. ([2j by alternating between the optimisations 
of A and £ in turn. The algorithm is guaranteed to converge, and does so to a stationary point of 
the true function over E defined in Eq. ([I}. At each step, the optimal solutions are: 



X* 

\j - 



N{xi, Xj, E) 



if j 7^ i , otherwise 



(3) 
(4) 



The "responsibilities" X*j define the relative proximity of Xj to Xj (compared to the proximity of all 
the Xfc's to Xj) and E* is the average of all the local covariance matrices. 

This algorithm, which we coin LCA, for local component analysis, transforms the data to make it 
locally isotropic, as opposed to PCA which makes it globally isotropic. Fig. (JTJ) shows a comparison 
of PCA and LCA on the word sequence "To be or not to be" . Whereas PCA is highly sensitive to 
the length of the text, LCA is only affected by the local shapes, thus providing a much less distorted 
resultlf 

1 Most of the literature on estimating the covariance matrix discards the log-likelihood cost because of its sensitivity 
to outliers and prefers AMISE (see, e.g., IDH05I ). However, in all our experiments, the number of datapoints was 
large enough so that LCA did not suffer from the presence of outliers. 

2 Since both methods are insensitive to any linear reparametrisation of the data, we do not include the original 
data in the figure. 
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no lb be or not to be 



Figure 1: Results obtained when transforming "To be or not to be" using PCA (left) and LCA (right). 
Left: To make the data globally isotropic, PCA awkwardly compresses the letters horizontally. 
Right: Since LCA is insensitive to the spacing between letters and to the length of the text, there 
is no horizontal compression. 

First, one may note at this point that Manifold Parzen Windows |VB02j is equivalent to LCA with 
only one step of EM. This makes Manifold Parzen Windows more sensitive to the choice of the 
original covariance matrix whose parameters must be carefully chosen. As we shall see later in the 
experiments, running EM to convergence is important to get good accuracy when using spectral 
clustering on the transformed data. Second, it is also worth noting that, similarly to Manifold 
Parzen Windows, LCA can straightforwardly be extended to the cases where each datapoint uses 
its own local covariance matrix (possibly with a smoothing term), or where the covariance S* is the 
sum of a low-rank matrix and some scalar multiplied by the identity matrix. 

Not only may LCA be used to learn a linear transformation of the space, but it also defines a density 
model. However, there are two potential negative aspects associated with this method. First, in 
high dimensions, Parzen windows is prone to overfitting and must be regularized Sil86 j . Second, 
if there are some directions containing a small Gaussian noise, the local isotropy will blow them 
up, swamping the data with clutter. This is common to all the techniques which renormalise the 
data by the inverse of some variance. A solution to both of these issues is to consider a product of 
two densities: one is a low-dimensional Parzen windows estimator, which will model the interesting 
signal, and the other is a Gaussian, which will model the noise. 

4 LCA with a multiplicative Gaussian component 

We now assume that there are irrelevant dimensions in our data which can be modelled by a Gaussian. 
In other words, we consider an invertible linear transformation (Bq, Bl) j x of the data, modelling 
BqX as a multivariate Gaussian and Bjx through kernel density estimation, the two parts being 
independent, leading to p(x) tx p(i?Jx, i?Jx) = pg(Bqx)pl(BJx), where pa is a Gaussian and pl 
is the Parzen windows estimator, i.e., 

- ^BaBl^i -n)- l( Xi - x J ) T B i Bj(x i - Xj -) , 

with (Bq, Bl) a full-rank square matrix. Using EM, we can upper-bound the negative log-likelihood: 
- 2^1ogp( Xl ) < ti{BlC G B G ) + tr(BjC L B L ) - log \B G B^ + B L B T L \ , (5) 

i 

with 

C G = - Y^(xj - /i)(xj - /i) T , C L = - } Xij(xi - x,-)(xj - Xj ) T . 
n n 

The matrices Bq and Bl minimizing the right-hand side of Eq. (|SJ) may be found using the following 
proposition (see proof in the appendix): 



p(xi) oc 



\BgB c 



BlB \> 



4 



Proposition 1 Let B G £ M dxdl and B L £ R dxd2 , with d = di + d 2 and B — (B G ,B L ) £ R dxd 
invertible. Consider two symmetric positive matrices Mi and M 2 in M. dxd . The problem 

min tiB^MxBc + tr BJM 2 B l - \ogdet(B G B^ + B L Bj) (6) 
/>'..•/>'.■ 

has a finite solution only if Mi and M 2 are invertible and, if these conditions are met, reaches its 
optimum at 

B G = M[ 1/2 U + , B L = M[ 1/2 U_DZ 1/2 , 

1/2 1/2 

where U + are the eigenvectors of M 1 M 2 M X associated with eigenvalues greater than or equal 

1/2 1/2 

to 1, U- are the eigenvectors of M x M 2 M 1 associated with eigenvalues smaller than 1 and 

1/2 1/2 

D- is the diagonal matrix containing the eigenvalues of ' M 1 M 2 M 1 smaller than 1. 

The resulting procedure is described in Algorithm [TJ where all dimensions are initially modelled by 
the Parzen windows estimator, which empirically yielded the best results. 

Algorithm 1 LCA - Gauss 

Input: X (dataset), itcrMax (maximum number of iterations), v (regularisation) 
Output: Be (Gaussian part transformation), Bl (Parzen windows transformation) 
Cq cov(A) + vld {Initialize C to the global covariance} 

Ig <- C G h 

Bq = 0, Br, = cho^Cg, 1 ) {Assign all dimensions to the Parzen windows estimator} 
for iter = Piter Max do 



Mij exp 



(xi-Xj)^ BfB L (xj 



Ma <- 



A 



Mjj 



C L < 2 h vh 

[V,D] <— eig^cCLlc) {Eigendecomposition of IqClIg} 

t\ = max z D(z, z) < 1 {Cut-off between eigenvalues smaller and larger than 1} 
*+ = {Ah < t < d} , t- = {t\l < t < h} 
V+<-V(:,t+), V-^V{i,tJ), D_ = D(t_,t_) 

B L =I G V^DZ 1/2 , B G = I G V+ 
end for 



Relationship with ICA. Independent component analysis (ICA) can be seen as a density model 
where x = As and s has independent components (see, e.g., [HKO01 ). In the Parzen windows 
framework, this corresponds to modelling the density of s by a product of univariate kernel density 
estimators [BPR04 . This however causes two problems: first, while this assumption is appropriate 
in settings such as source separation, it is violated in most settings, and having a multivariate kernel 
density estimation is preferable. Second, most algorithms are dedicated to finding independent 
components which are non-Gaussian. In the presence of more than one Gaussian dimension, most 
ICA frameworks become unidentifiable, while our explicit modelling of such Gaussian components 
allows us to tackle this situation (a detailed analysis of the identifiability of our Parzen/Gaussian 
model is out of the scope of this paper). 

Relationship with NGCA. NGCA [BK S + 06] makes an assumption similar to ours (they rather 
assume an additive Gaussian noise on top of a low-dimensional non-Gaussian signal) but uses a pro- 
jection pursuit algorithm to iteratively find the directions of non-Gaussianity. Unlike in FastICA, the 
contrast functions used to find the interesting directions can be different for each direction. However, 
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like all projection pursuit algorithms, the identification of interesting directions gets much harder in 
higher dimensions, as most of them will be almost Gaussian. Our use of a non-parametric density 
estimator with a log-likelihood cost allows us to globally optimize all directions simultaneously and 
does not rely on the model being correct. Finally, LCA estimates all its parameters from data as 
opposed to NGCA which requires the number of non-Gaussian directions to be set. 

Escaping local optima. Though our model allows for the modification of the number of di- 
mensions modelled by the Gaussian through the analysis of the spectrum of C G ClC g , it is 
sensitive to local optima. It is for instance rare that a dimension modelled by a Gaussian is switched 
to the Parzen windows estimator. Even though the algorithm will more easily switch from the Parzen 
windows estimator to the Gaussian model, it will typically stop too early, that is model many dimen- 
sions using the Parzen windows estimator rather than the better Gaussian. To solve these issues, we 
propose an alternate algorithm, LCA- Gauss- Red, which explores the space of dimensions modelled 
by a Gaussian more aggressively using a search algorithm, namely: 

1. We run the algorithm LCA - Gauss for a few iterations (40 in our experiments); 

2. We then "transfer" some columns from Bl (the Parzen windows model) to Bq (the Gaussian 
model), and rerun LCA - Gauss using these new matrices as initialisations; 

3. We iterate step 2 using a dichotomic search of the optimal number of dimensions modelled by 
the Gaussian, until a local optimum is found; 

4. Once we have a locally optimum number of dimensions modelled by the Gaussian model, we 
run LCA - Gauss to convergence. 

5 Speeding up LCA 

Computing the local covariance matrix of the points using Eq. ([3]) and Eq. Q has a complexity in 
0(dn 2 + d 2 n + d 3 ), with d the dimensionality of the data and n the number of training points. Since 
this is impractical for large datasets, we can resort to sampling to keep the cost linear in the number 
of datapoints. We may further use low-rank or diagonal approximation to achieve a complexity 
which grows quadratically with d instead of cubically. 

5.1 Averaging a subset of the local covariance matrices 

Instead of averaging the local covariances over all datapoints, we may only average them over a 
subset of datapoints. This estimator is unbiased and, if the local covariance matrices are not too 
dissimilar, which is the assumption underlying LCA, then its variance should remain small. This 
is equivalent to using a minibatch procedure: every time we have a new minibatch of size B, we 
compute its local covariance Cl, which is then averaged with the previously computed Cl using 

C L <- 7 3c i + (i- 7 #)^ (7) 

to yield the updated Cl- The exponent B/n is so that 7, the discount factor, determines the weight 
of the old covariance matrix after an entire pass through the data, which makes it insensitive to 
the particular choice of batch size. As opposed to many such algorithms where the choice of 7 is 
critical as it helps retaining the information of previous batches, the locality of the EM estimate 
makes it less so. However, if the number of datapoints used to estimate Cl is not much larger than 
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the dimension of the data, we need to set a higher 7 to avoid degenerate covariance matrices. In 
simulations, we found that using a value of 7 = .6 worked well. Similarly, the size of the minibatch 
influences only marginally the final result and we found a value of 100 to be large enough. 

5.2 Computing the local covariance matrices using a subset of the data- 
points 

Rather than using only a subset of local covariance matrices, one may also wonder if using the 
entire dataset to compute these matrices is necessary. Also, as the number of datapoints grows, 
the chances of overfitting increase. Thus, one may choose to use only a subset of the datapoints to 
compute these matrices. This will increase the local covariances, yielding a biased estimate of the 
final result, but may also act as a regulariser. In practice, for very large datasets, one will want the 
largest neighbourhood size while keeping the computational cost tractable. 

Denoting rii the number of locations at which we estimate the local covariance and rij the number of 
neighbours used to estimate this covariance, the cost per update is now 0(d [n^ + rij] + driirij + d ). 
Since only rij should grow with n, this is linear in the total number of datapoints. 

Though they may appear similar, these are not "landmark" techniques (see, e.g., |DST04j ) as there 
is still one Gaussian component per datapoint, and the n, datapoints around which we compute the 
local covariances are randomly sampled at every iteration. 

6 Experiments 

LCA has three main properties: first, it transforms the data to make it locally isotropic, thus being 
well-suited for preprocessing the data before using a clustering algorithm like spectral clustering; 
second, it extracts relevant, non-Gaussian components in the data; third, it provides us with a good 
density model through the use of the Parzen windows estimator. 

In the experiments, we will assess the performance of the following algorithms: LCA, the original 
algorithm; LCA-Gauss, using a multiplicative Gaussian component, as described in Section[4] LCA- 
Gauss-Red, the variant of LCA-Gauss using the more aggressive search to find a better number of 
dimensions to be modelled by the Gaussian component. The MATLAB code for LCA, LCA-Gauss 
and LCA-Gauss-Red is available at http://nicolas.le-roux.name/code.html 

6.1 Improving clustering methods 

We first try to solve three clustering problems: one for which the clusters are convex and the 
direction of interest does not have a Gaussian marginal (Fig. J2j, left), one for which the clusters 
are not convex (Fig. ([2]), middle), and one for which the directions of interest have almost Gaussian 
marginals (Fig. @, right). Following [BH07, BJ04], the data is progressively corrupted by adding 
dimensions of white Gaussian noise, then whitened. We compare here the clustering accuracy, which 
is defined as ^ minp tt(EP) where E is the confusion matrix and P is the set of permutations over 
cluster labels, obtained with the following five techniques: 

1. Spectral clustering (SC) NJW01 on the whitened data (using the code of [CSB+11] '): 

2. SC on the projection on the first two components found by FastICA using the best contrast 
function and the correct number of components; 
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3. SC on the data transformed using the metric learnt with LCA; 

4. SC on the data transformed using the metric learnt with the product of LCA and a Gaussian; 

5. SC on the projection of the data found using NGCA |BKS+06| with the correct number of 
components. 

Our choice of spectral clustering stems from its higher clustering performance compared to K- 
means. Results are reported in Fig. ([3]). Because of the whitening, the Gaussian components in 
the first dataset are shrunk along the direction containing information. As a result, even with 
little noise added, the information gets swamped and spectral clustering fails completely. On the 
other hand, LCA and its variants are much more robust to the presence of irrelevant dimensions. 
Though NGCA works very well on the first dataset, where there is only one relevant component, 
its performance drops quickly when there are two relevant components (note that, for all datasets, 
we provided the true number of relevant dimensions as input to NGCA). This is possibly due to 
the deflation procedure which is not adapted when no single component can be clearly identified 
in isolation. This is in contrast with LCA and its variants which circumvent this issue, thanks to 
their global optimisation procedure. Note also that LCA-Gauss allows us to perform unsupervised 
dimensionality reduction with the same performance as previously proposed supervised algorithms 



Figure (j4]) shows the clustering accuracy on the three datasets for various numbers of EM iterations, 
one iteration corresponding to Manifold Parzen Windows [VB02 with a Gaussian kernel whose 
covariance matrix is the data covariance kernel. As one can see, running the EM algorithm to 
convergence yields a significant improvement in clustering accuracy. The performance of Manifold 
Parzen Windows could likely have been improved with a careful initialisation of the original kernel, 
but this would have been at the expense of the simplicity of the algorithm. 



Figure 2: Noise-free data used to assess the robustness of isT-means to noise. Left: mixture of two 
isotropic Gaussians of unit variance and means [— 3,0] T and [3, 0] T . Centre: two concentric circles 
with radii 1 and 2, with added isotropic Gaussian noise of standard deviation .1. Right: mixture of 
five Gaussians. The centre cluster contains four times as many datapoints as the other ones. 



6.2 LCA as a density model 

We now assess the quality of LCA as a density model. We build a density model of the USPS digits 
dataset, a 256-dimensional dataset of handwritten digits. We compared several algorithms: 

1. An isotropic Parzen windows estimator with the bandwidth estimated using LCA (replacing 
E* of Eq. (0| by XI so that the two matrices have the same trace); 



(e.g., [BJ04] 1 
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Noise dimensions Noise dimensions Noise dimensions 

H-RawSC -e-ICA -B-LCA -0-LCA-Gauss LCA-Gauss-Red-#-NGCA 

Figure 3: Average clustering accuracy (100% = perfect clustering, chance is 50% for the first two 
datasets and 20% for the last one) on 100 runs for varying number of dimensions of noise added. The 
error bars represent one standard error. Left: mixture of isotropic Gaussians presented in Fig. ^ 
(left). Centre: two concentric circles presented in Fig. ([2]) (centre). Right: mixture of five Gaussians 
presented in Fig. ([2]) (right). Figure best seen in colour. 




Noise dimensions Noise dimensions Noise dimensions 



Figure 4: Average clustering accuracy (100% = perfect clustering, chance is 50% for the first two 
datasets and 20% for the last one) on 100 runs for varying number of dimensions of noise added 
and varying number of EM iterations in the LCA algorithm (MPW = one iteration). The error 
bars represent one standard error. Left: mixture of isotropic Gaussians presented in Fig. ([2]) (left). 
Centre: two concentric circles presented in Fig. ^ (centre). Right: mixture of five Gaussians 
presented in Fig. ([2]) (right). Figure best seen in colour. 

2. A Parzen windows estimator with diagonal metric (equal to the diagonal of S* in Eq. (|4|; 

3. A Parzen windows estimator with the full metric as obtained using LCA; 

4. A single Gaussian model; 

5. A product of a Gaussian and a Parzen windows estimator (as described in Section [4j. 

The models were trained on a set of 2000 datapoints and regularized by penalizing the trace of 
(in the case of the last model, both covariance matrices, local and global, were penalized). The 
regularisation parameter was optimized on a validation set of 1000 datapoints. For the last model, 
the regularisation parameter of the global covariance was set to the one yielding the best performance 
for the full Gaussian model on the validation set. Thus, we only had to optimize the regularisation 
parameter for the local covariance. 
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The final performance was then evaluated on a set of 3000 datapoints which had not been used for 
training nor validation. We ran the experiment 20 times, randomly selecting the training, validation 
and test set each time. 

Fig. ([5]) shows the mean and the standard error of the negative log-likelihood on the test set. As one 
can see, modelling all dimensions using the Parzen windows estimator leads to poor performance 
in high dimensions, despite the regulariser and the leave-one-out criterion. On the other hand, 
LCA-Gauss and LCA-Gauss-Red clearly outperform all the other models, justifying our choice of 
modelling some dimensions using a Gaussian. Also, as opposed to the previous experiments, there 
is no performance gain induced by the use of LCA-Gauss-Red as opposed to LCA-Gauss, which we 
believe stems from the fact that the switch from one model to the other is easier to make when there 
are plenty of dimensions to choose from. The poor performance of LCA-Full is a clear indication of 
the problems suffered by Parzen windows in high dimensions. 



LCA - Isotropic 


269.78 ±0.18 


LCA - Diagonal 


109.59 ±0.56 


LCA - Full 


32.98 ±0.35 


Gaussian 


32.27 ±0.36 


LCA - Gauss 


19.09 ±0.39 


LCA - Gauss - Red 


19.09 ±0.39 



Figure 5: Test negative log-likelihood on the USPS digits dataset, averaged over 20 runs. 



6.3 Subsampling 

We now try to evaluate the loss in performance incurred by the use the subsampling procedure 
described in Section [5j both on the train and test negative log-likelihoods. For that purpose, we 
used the USPS digit recognition dataset, which contains 8298 datapoints in dimension 256, which 
we randomly split into a training set of n = 6000 datapoints, using the remaining 2298 as the test 
set. We tested the following hyperparameters: 

• Discount factor 7 = 0.3, 0.6, 0.9 , 

• Batch size B = 1000,3000,6000 , 

• Neighbourhood size N = 1000, 3000, 6000 . 

Fig. (J6j) show the log-likelihood differences induced by the use of smaller batch sizes and neighbour- 
hood sizes. For each set of hyperparameters, 20 experiments were run using different training and 
test sets, and the means and standard errors are reported. The results for 7 = 0.9 were very similar 
and are not included due to space constraints. 

Three observations may be made. First, reducing the batchsize has little effect, except when 7 
is small. Second, reducing the neighbourhood size has a regularizing effect at first but drastically 
hurts the performance if reduced too much. Third, the value of 7, the discount factor, has little 
influence, but larger values proved to yield more consistent test performance, at the expense of 
slower convergence. The consistency of these results shows that it is safe to use subsampling (with 
values of 7 = 0.6, B — 100 and N — 3000, for instance) especially if the training set is very large. 
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N = 1000 


N = 3000 


N = 6000 


N = 1000 


N = 3000 


N = 6000 


B = 1000 


6.43 ±0.10 


2.70 ± 0.06 


-0.10 ±0.03 


6.43 ±0.10 


2.80 ± 0.06 


-0.17 ±0.02 


B = 3000 


6.58 ±0.07 


2.73 ±0.05 


-0.06 ± 0.03 


6.54 ±0.07 


2.80 ± 0.06 


-0.11 ±0.02 


B = 6000 


6.22 ±0.08 


2.21 ±0.03 


0.00 ± 0.01 


6.18 ±0.07 


1.98 ± 0.03 


0.02 ± 0.02 





N = 1000 


N = 3000 


N = 6000 


N = 1000 


N = 3000 


N = 6000 


B = 1000 


2.12 ±0.16 


0.20 ±0.08 


0.65 ± 0.05 


2.01 ±0.17 


0.07 ±0.09 


0.15 ±0.03 


B = 3000 


2.11 ±0.16 


0.30 ±0.07 


0.46 ± 0.04 


2.01 ±0.17 


0.16 ±0.09 


0.09 ± 0.03 


B = 6000 


1.54 ±0.16 


-0.75 ±0.07 


-0.01 ±0.01 


1.43 ±0.15 


-1.07 ±0.08 


0.01 ±0.02 



Figure 6: Train (top) and test (bottom) negative log- likelihood differences induced by the use of 
smaller batch and neighbourhood sizes compared to the original model (7 = 0, B = 6000, N = 6000) 
for 7 = 0.3 (left) and 7 = 0.6 (right). A negative value means better performance. 



7 Conclusion 

Despite its importance, the learning of local or global metrics is usually an overseen step in many 
practical algorithms. We have proposed an extension of the general bandwidth selection problem to 
the multidimensional case, with a generalisation to the case where several components are Gaussian. 
Additionally, we proposed an approximate scheme suited to large datasets which allows to find a 
local optimum in linear time. We believe LCA can be an important preprocessing tool for algorithms 
relying on local distances, such as manifold learning methods or many semi-supervised algorithms. 
Another use would be to cast LCA within the mean-shift algorithm, which finds the modes of the 
Parzen windows estimator, in the context of image segmentation [CM02j . In the future, we would 
like to extend this model to the case where the metric is allowed to vary with the position in space, 
to account for more complex geometries in the dataset. 
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Appendix 

We prove here Proposition [1] 

Proof If Mi is singular, then the minimum value is —00, because we can have BqM\Bq bounded 
while BqBq tends to ±00 (for example, if d\ = 1, and Ui is such that M\U\ = 0, select Bq = Aui 
with A — > ±00). The reasoning is similar for Mi. 

We thus assume that M\ and M2 are invcrtible. We consider the eigendecomposition of M 1 1 MiM x 
?7Diag(e)J7 T , which corresponds to the generalized eigendecomposition of the pair (Mi,M 2 ). 
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Denoting A 2 = XJ T M\ l2 B L and A x = U T Ml /2 B G , we have: 

trB^AhBa + tr BjM 2 B L - logdet(B G Bj + B L Bj) 
= tr Aj Ax + tr Aj Diag(e)A 2 

-logdct(Ai^j r + A 2 Aj) +logdetMi 
= tr AjAx + tr A 2 Diag(e) A 2 - log det(A[ A 2 ) 

- log det ( Aj {I - A 2 ( Aj A 2 ) - 1 Aj ) A x ) + log det Mi . 

By taking derivatives with respect to A±, we get 

Ax = (I-Il 2 )Ax(Aj(I-Il 2 )Ax)- 1 , (8) 
with n 2 = A 2 (A2 A 2 )~ 1 A], . By left- multiplying both sides of Eq. © by A 2 , we obtain 

A^i = . 

By left-multiplying by Aj , we get 

AjAx =1. 

Thus, we now need to minimize with respect to A 2 the following cost function 



dx + tx-Aj Diag(e),4 2 - logdet(A 2 A 2 ) + log det Mi 

Let s be the vector of singular values of A 2 , ordered in decreasing order and let the e, be ordered in 
increasing order. We have: 

trDiag(e)v4 2 v4 2 r = - tr(- Diag(e) A 2 Aj) ^ ^e,s 2 ' 

i 

with equality if and only if the eigenvectors of A 2 A 2 T are aligned with the ones of Diag(e) (the — e, 
being also in decreasing order) (Theorem 1.2.1, |BL00j ). 

Thus, we have A 2 Aj = diag(s) 2 with only d 2 non-zero elements in s. Let J 2 be the index of non 
zero-elements. We thus need to minimize 

dx + log det Mi + (e 3 -s| - logs 2 ) , 

with optimum s 2 = ej 1 and value: 

dx + d 2 + log det Mi + ^ loge,- . 

Thus, we need to take J 2 corresponding to the smallest eigenvalues ej. If we also optimize with 
respect to d 2 , then J 2 must only contain the elements smaller than 1. ■ 
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